Basic Plots

Using Base R built-in data set, mtcars, we will display various plots.

Built-in R Functions

Scatter Plot (Single Variable)

plot(mtcars$mpg)

The above plot uses the mpg variable of the mtcars data set. When a single variable is provided, it is plotted as the Y-axis. The plot does not provide informative insights, but instead showcases how easy it is to visualize a plot using R.

Scatter Plot (Two Variable)

Using a scatter plot for two variables can help give insight to the relationship between the two variables. Below will be a scatter plot of wt versus mpg (weight vs. miles per gallon).

plot(mtcars$wt, mtcars$mpg)

Looking at the scatter plot above, it looks as if the miles per gallon of a car decreases as the weight of the car increases.

When two variables are provided, the first variable given by default will become the X-axis and the second variable will be the Y-axis. We can specify the order of the parameters explicitly by declaring it with its supplied argument or variable. For example, plot(y = mtcars$mpg, x = mtcars$wt). The first variable given becomes the Y-axis and the second variable becomes the X-axis.

Line Chart (Single Variable)

By default, plot() will create a scatter plot, but if we were to provide a different argument to the type parameter, we can get a different plot. For example, we can instead plot a line chart by implementing type = "l".

plot(mtcars$mpg, type = "l")

Line Chart (Two Variables)

plot(mtcars$wt, mtcars$mpg, type = "l")

As we can see above, the usage of a certain chart sometimes might not make sense. This is because the line is drawn based on the order of the observation point. It would be more insightful to use linear regression to create predictions. Linear regression will be discussed in another reference document.

Bar Plot

The bar plot will use the cyl variable from the mtcars data set. cyl refers to the amount of cylinder in the cars.

barplot(table(mtcars$cyl))

The table() function was used to aggregate the number of cylinders. Let’s try using it on it’s own.

table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

There are 11 cars with 4 cylinders, 7 cars with 6 cylinders, and 14 cars with 8 cylinders.

Histogram

Using a histogram can give us some insight to the distribution of data.

hist(mtcars$mpg)

We can see that majority of the cars within the mtcars data set can run between 15 to 20 miles per gallon of gas, while very few can run between 25 to 30 miles per gallon.

Box Plot

We can use a box plot to display the range, 1st quantile, median, and 3rd quantile. Below will be a box plot for the mpg variable of the mtcars data set. The horizontal argument will be set to TRUE to display the box plot horizontally rather than vertically, which is the default.

boxplot(mtcars$mpg, horizontal = TRUE)

Customizing the Plots

The following will use the scatter plot of wt versus mpg.

The plot can be customized to suit your preferences such as adding a title, axis labels, and changing the color of the data marker, title, or labels.

Base R provides us with enough to make plots with simple customization. Next, we will showcase ggplot2, a package from the tidyverse collection that will allow us to have further control over plot customization.

ggplot2 for Visualization

The ggplot2 package allows us to create beautiful visualization by creating a base visualization that we can add on to. For example, we will first create a simple scatter plot of wt versus mpg using ggplot2. Then we will add on to it. Before creating the ggplot, we first need to load the required library.

library(ggplot2)

ggplot2 Scatter Plot

After loading the ggplot2 package, we can start creating the base of a scatter plot.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point()

We used the ggplot() function to create the ggplot to which we provided mtcars as the data argument. You would provide the data point for the x- and y-axis in the aes() portion of the code. With that, the ggplot knows what data it is working with and which variables will be used. The + symbol is used to add on to that. With the example above, we wanted to have the data become points essentially representing a scatter plot.

Comparing the simple ggplot2 scatter plot to the one made with base R, we can already see that it looks a lot better. Let’s continue making it look even better.

ggplot2 Scatter Plot With Regression Line

Note that we can add on many other things to the plot by using more + following addition desired features. For example, let’s add a linear regression line to the scatter plot by using geom_smooth(method = "lm").

## `geom_smooth()` using formula = 'y ~ x'

The line that runs through the scatter plot is the linear regression line that gives us an understanding of the relationship between the two variable. This line can also be used to make predictions on new data. The area that is dark green displays the confidence interval for the regression line. This is displayed because the se parameter is defaulted to TRUE. You can turn it off using se = FALSE.

ggplot2 Bar Plot

For the x argument, since the cyl variable contain only 3 possible values, the values of the cyl variable should be categorical, thus we need to factor the variable. We will further customize the above plot by coloring each bar a different color.

ggplot2 Histogram

We can see that this histogram looks similar to the one using the base R function hist(), but a lot nicer.

ggplot2 Box Plot

We can also apply a color filling by categorical group. For example, the following plot will display box plots for mpg but separated by the number of cylinders.

We could also have applied a facet grid.

Map Visualization

Choropleth Map

Using the map_data() function from the ggplot2 package, we will create a map of the United States to which each state is filled in with its own color.

The map above does not give much or any information at all, but it sets up as a starting template for geo-spatial data (specifically the United States in the example). The following map visualization will be using a data set provided by CSU East Bay, STAT 541 - Intro. Data Visualization. First, let’s have a brief look into the data set.

library(dplyr)
library(tibble)

Note: The dplyr is included in the tidyverse collection and is a great package for data manipulation thanks to how it is grammatically structured when used.

## # A tibble: 1,269 × 17
##        id name   city  state region highest_degree control gender admission_rate
##     <int> <chr>  <chr> <chr> <chr>  <chr>          <chr>   <chr>           <dbl>
##  1 102669 Alask… Anch… AK    West   Graduate       Private CoEd            0.421
##  2 101648 Mario… Mari… AL    South  Associate      Public  CoEd            0.614
##  3 100830 Aubur… Mont… AL    South  Graduate       Public  CoEd            0.802
##  4 101879 Unive… Flor… AL    South  Graduate       Public  CoEd            0.679
##  5 100858 Aubur… Aubu… AL    South  Graduate       Public  CoEd            0.835
##  6 100663 Unive… Birm… AL    South  Graduate       Public  CoEd            0.857
##  7 101480 Jacks… Jack… AL    South  Graduate       Public  CoEd            0.833
##  8 102049 Samfo… Birm… AL    South  Graduate       Private CoEd            0.595
##  9 101709 Unive… Mont… AL    South  Graduate       Public  CoEd            0.743
## 10 100751 The U… Tusc… AL    South  Graduate       Public  CoEd            0.510
## # ℹ 1,259 more rows
## # ℹ 8 more variables: sat_avg <int>, undergrads <int>, tuition <int>,
## #   faculty_salary_avg <int>, loan_default_rate <chr>, median_debt <dbl>,
## #   lon <dbl>, lat <dbl>

Dimension: \(1,269 \times 17\).

Variables:

  • id (ID number)
  • name (Name of college)
  • city (City)
  • state (State)
  • region (Midwest/Northeast/South/West)
  • highest_degree (Associate/Bachelor/Graduate)
  • control (Public/Private)
  • gender (CoEd/Women/Men)
  • admission_rate (Admission Rate)
  • sat_avg (SAT average)
  • undergrads (Number of undergraduates)
  • tuition (Tuition Cost)
  • faculty_salary_avg (Average salary of faculty)
  • loan_default_rate (Rate of load defaults)
  • median_debt (Median debt)
  • lon (Longitude)
  • lat (Latitude)

Now to correct the data types of the variables and aggregate the data to count the number of colleges within a specific state and region.

# Correcting the data type of the variables.
college <- college %>%
  mutate(state=as.factor(state),
  region=as.factor(region),
  highest_degree=as.factor(highest_degree),
  control=as.factor(control),
  gender=as.factor(gender),
  loan_default_rate=as.numeric(loan_default_rate))

college_summary <- college %>% 
  group_by(state, region) %>% 
  summarise("School Count" = n())
college_summary
## # A tibble: 51 × 3
## # Groups:   state [51]
##    state region    `School Count`
##    <fct> <fct>              <int>
##  1 AK    West                   1
##  2 AL    South                 24
##  3 AR    South                 16
##  4 AZ    West                   6
##  5 CA    West                  71
##  6 CO    West                  14
##  7 CT    Northeast             14
##  8 DC    South                  6
##  9 DE    South                  3
## 10 FL    South                 36
## # ℹ 41 more rows

Using the data, we will plot a map of the United States with each state being color-coded based on the number of colleges it has.

Animation

A beautiful plot can give use a lot of information and can be used to determine relationships, trends, clustering, and more, but implementing animation can bring the story to life.

We will use the gapminder data set from the gapminder package to visualize some animated plots. Let’s briefly look into the gapminder data set.

library(tibble)
library(gapminder)
as_tibble(gapminder)
## # A tibble: 1,704 × 6
##    country     continent  year lifeExp      pop gdpPercap
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.
##  2 Afghanistan Asia       1957    30.3  9240934      821.
##  3 Afghanistan Asia       1962    32.0 10267083      853.
##  4 Afghanistan Asia       1967    34.0 11537966      836.
##  5 Afghanistan Asia       1972    36.1 13079460      740.
##  6 Afghanistan Asia       1977    38.4 14880372      786.
##  7 Afghanistan Asia       1982    39.9 12881816      978.
##  8 Afghanistan Asia       1987    40.8 13867957      852.
##  9 Afghanistan Asia       1992    41.7 16317921      649.
## 10 Afghanistan Asia       1997    41.8 22227415      635.
## # ℹ 1,694 more rows

Dimension: \(1,704 \times 6\).

Variables:

  • country (Country)
  • continent (Continent)
  • year (Year)
  • lifeExp (Life Expectancy)
  • pop (Population)
  • gdpPercap (GDP per Capita)

gganimate is the required package.

gganimate Scatter Plot

The size of the scatter point is associated with the population size of the country. There is no meaning behind the colors of the data points. It can be see that through the years, as the log of the GDP Per Capita increases, there looks to be an association with an increase to life expectancy.

gganimate Line Chart

Clearly, the population of all the continents pale in comparison to Asia’s.

gganimate Bar Chart With Shadow Mark

gganimate Box Plot

Ending Notes

Data visualization helps us see data in a way that makes it easy for us to understand. It allows use to see the story that the data is telling. We can derive information from them and use that to make better decisions. This reference document will continue to be updated as I learn more about data visualization.